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Abstract — We consider the problem of joint universal 
variable-rate lossy coding and identification for parametric 
classes of stationary /3-mixing sources with general (Polish) 
alphabets. Compression performance is measured in terms 
of Lagrangians, while identification performance is measured 
by the variational distance between the true source and the 
estimated source. Provided that the sources are mixing at 
a sufficiently fast rate and satisfy certain smoothness and 
Vapnik-Chervonenkis learnability conditions, it is shown that, 
for bounded metric distortions, there exist universal schemes 
for joint lossy compression and identi fication who se Lagrangian 
redundancies converge to zero as \JV n log n/n as the block 
length n tends to infinity, where V n is the Vapnik-Chervonenkis 
dimension of a certain class of decision regions defined by the n- 
dimensional marginal distributions of the sources; furthermore, 
for each n, the decoder can identify n-dimens ional marg inal 
of the active source up to a ball of radius 0(\JV n log n/n) in 
variational distance, eventually with probability one. The results 
are supplemented by several examples of parametric sources 
satisfying the regularity conditions. 

Index Terms — Learning, minimum-distance density estimation, 
two-stage codes, universal vector quantization, Vapnik- 
Chervonenkis dimension. 



I. Introduction 

It is well known that lossless source coding and statis- 
tical modeling are complementary objectives. This fact is 
captured by the Kraft inequality (see Section 5.2 in Cover 
and Thomas [1]), which provides a correspondence between 
uniquely decodable codes and probability distributions on a 
discrete alphabet. If one has full knowledge of the source 
statistics, then one can design an optimal lossless code for the 
source, and vice versa. However, in practice it is unreasonable 
to expect that the source statistics are known precisely, so one 
has to design universal schemes that perform asymptotically 
optimally within a given class of sources. In universal coding, 
too, as Rissanen has shown in [2], [3], the coding and modeling 
objectives can be accomplished jointly: given a sufficiently 
regular parametric family of discrete-alphabet sources, the en- 
coder can acquire the source statistics via maximum-likelihood 
estimation on a sufficiently long data sequence and use this 
knowledge to select an appropriate coding scheme. Even in 
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nonparametric settings (e.g., the class of all stationary ergodic 
discrete-alphabet sources), universal schemes such as Ziv- 
Lempel [4] amount to constructing a probabilistic model for 
the source. In the reverse direction, Kieffer [5] and Merhav 
[6], among others, have addressed the problem of statistical 
modeling (parameter estimation or model identification) via 
universal lossless coding. 

Once we consider lossy coding, though, the relationship 
between coding and modeling is no longer so simple. On 
the one hand, having full knowledge of the source statis- 
tics is certainly helpful for designing optimal rate-distortion 
codebooks. On the other hand, apart from some special cases 
(e.g., for i.i.d. Bernoulli sources and the Hamming distortion 
measure or for i.i.d. Gaussian sources and the squared-error 
distortion measure), it is not at all clear how to extract a 
reliable statistical model of the source from its reproduction 
via a rate-distortion code (although, as shown recently by 
Weissman and Ordentlich [7], the joint empirical distribution 
of the source realization and the corresponding codeword of 
a "good" rate-distortion code converges to the distribution 
solving the rate-distortion problem for the source). This is 
not a problem when the emphasis is on compression, but 
there are situations in which one would like to compress 
the source and identify its statistics at the same time. For 
instance, in indirect adaptive control (see, e.g., Chapter 7 of 
Tao [8]) the parameters of the plant (the controlled system) 
are estimated on the basis of observation, and the controller 
is modified accordingly. Consider the discrete-time stochastic 
setting, in which the plant state sequence is a random process 
whose statistics are governed by a finite set of parameters. 
Suppose that the controller is geographically separated from 
the plant and connected to it via a noiseless digital channel 
whose capacity is R bits per use. Then, given the time horizon 
T, the objective is to design an encoder and a decoder for 
the controller to obtain reliable estimates of both the plant 
parameters and the plant state sequence from the 2 TR possible 
outputs of the decoder. 

To state the problem in general terms, consider an infor- 
mation source emitting a sequence X = {X{\i e z of random 
variables taking values in an alphabet X. Suppose that the 
process distribution of X is not specified completely, but it is 
known to be a member of some parametric class {Pg : 9 e A}. 
We wish to answer the following two questions: 

1) Is the class {Pg : 9 <E A} universally encodable with 
respect to a given single-letter distortion measure p, by 
codes with a given structure (e.g., all fixed-rate block 
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codes with a given per-letter rate, all variable-rate block 
codes, etc.)? In other words, does there exist a scheme 
that is asymptotically optimal for each Pg, 9 E A? 
2) If the answer to Question 1) is positive, can the codes 
be constructed in such a way that the decoder can not 
only reconstruct the source, but also identify its process 
distribution Pg, in an asymptotically optimal fashion? 

In previous work [9], [10], we have addressed these two 
questions in the context of fixed-rate lossy block coding 
of stationary memoryless (i.i.d.) continuous-alphabet sources 
with parameter space A a bounded subset of R k for some 
finite k. We have shown that, under appropriate regularity 
conditions on the distortion measure and on the source models, 
there exist joint universal schemes for lossy coding and source 
identification whose redundancies (that is, the gap between 
the actual performance and the theoretical optimum given by 
the Shannon distortion-rate function) and source estimation 
fidelity both converge to zero as 0(^/logn/n), as the block 
length n tends to infinity. The code operates by coding 
each block with the code matched to the source with the 
parameters estimated from the preceding block. Comparing 
this convergence rate to the log n/n convergence rate, which is 
optimal for redundancies of fixed-rate lossy block codes [11], 
we see that there is, in general, a price to be paid for doing 
compression and identification simultaneously. Furthermore, 
the constant hidden in the O(-) notation increases with the 
"richness" of the model class {Pg : 9 £ A}, as measured by 
the Vapnik-Chervonenkis (VC) dimension [12] of a certain 
class of measurable subsets of the source alphabet associated 
with the sources. 

The main limitation of the results of [9], [10] is the i.i.d. 
assumption, which is rather restrictive as it excludes many 
practically relevant model classes (e.g., autoregressive sources, 
or Markov and hidden Markov processes). Furthermore, the 
assumption that the parameter space A is bounded may not 
always hold, at least in the sense that we may not know the 
diameter of A a priori. In this paper we relax both of these 
assumptions and study the existence and the performance of 
universal schemes for joint lossy coding and identification 
of stationary sources satisfying a mixing condition, when 
the sources are assumed to belong to a parametric model 
class {Pg : 9 6 A}, A being an open subset of R fc for 
some finite k. Because the parameter space is not bounded, 
we have to use variable-rate codes with countably infinite 
codebooks, and the performance of the code is assessed by 
a composite Lagrangian functional [13] which captures the 
trade-off between the expected distortion and the expected 
rate of the code. Our result is that, under certain regularity 
conditions on the distortion measure and on the model class, 
there exist universal schemes for joint lossy source coding and 
identification such that, as the block length n tends to infinity, 
the gap between the actual Lagrangian performance and the 
optimal Lagrangian performance achievable by variable-rate 
codes at that block length, as well as the source estimation 
fidelity at the decoder, converge to zero as 0(y/V n log n/n), 
where V n is the VC dimension of a certain class of decision 
regions induced by the collection {Pg : 9 E A} of the n- 



dimensional marginals of the source process distributions. 

This result shows very clearly that the price to be paid for 
universality, in terms of both compression and identification, 
grows with the richness of the underlying model class, as 
captured by the VC dimension sequence {V n }. The richer 
the model class, the harder it is to learn, which affects the 
compression performance of our scheme because we use the 
source parameters learned from past data to decide how to 
encode the current block. Furthermore, comparing the rate at 
which the Lagrangian redundancy decays to zero under our 
scheme with the O (log n/n) result of Chou, Effros and Gray 
[14], whose universal scheme is not aimed at identification, we 
immediately see that, in ensuring to satisfy the twin objectives 
of compression and modeling, we inevitably sacrifice some 
compression performance. 

The paper is organized as follows. Section [XT] introduces 
notation and basic concepts related to sources, codes and 
Vapnik-Chervonenkis classes. Section [Til] lists and discusses 
the regularity conditions that have to be satisfied by the source 
model class, and contains the statement of our result. The 
result is proved in Section [IV] Next, in Section [V] we give 
three examples of parametric source families (namely, i.i.d. 
Gaussian sources, Gaussian autoregressive sources and hidden 
Markov processes) which fit the framework of this paper under 
suitable regularity conditions. We conclude in Section [VTI and 
outline directions for future research. Finally, the Appendix 
contains some technical results on Lagrange-optimal variable- 
rate quantizers. 

II. Preliminaries 

A. Sources 

In this paper, a source is a discrete-time stationary ergodic 
random process X = {A^}i S z with alphabet X. We assume 
that X is a Polish space (i.e., a complete separable metric 
spac^U) and equip X with its Borel er-field. For any pair of 
indices i.j S Z with i < j, let Xj denote the segment 
(Xj, Xi+i, . . . , Xj) of X. If P is the process distribution 
of X, then we let Ep{-} denote expectation with respect 
to P, and let P n denote the marginal distribution of A"™. 
Whenever P carries a subscript, e.g., P = Pg, we write 
Ee{-} instead. We assume that there exists a fixed cr-finite 
measure /x on X, such that the n-dimensional marginal of 
any process distribution of interest is absolutely continuous 
with respect to the product measure fi n , for all n > 1. We 
denote the corresponding densities dP n /dfi n by p n . To avoid 
notational clutter, we omit the superscript n from /i", P n 
and p n whenever it is clear from the argument, as in dn{x n ), 
dP{x n ) or p(x n ). 

Given two probability measures P, Q on a measurable space 
(Z,A), the variational distance between them is defined by 

d(P t Q)= sup y2\P(Ai) - Q(Ai)\, 

where the supremum is over all finite „4-measurable partitions 
of Z (see, e.g., Section 5.2 of Gray [15]). If p and q are 

'The canonical example is the Euclidean space K d for some d < oo. 
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the densities of P and Q, respectively, with respect to a 
dominating measure v, then we can write 

d(P,Q) = J \p(z) - q(z)\du(z). 

A useful property of the variational distance is that, for any 
measurable function / : Z -> [0, 1], | E P /—Eg f\ < d(P, Q). 
When P and Q are n-dimensional marginals of Pg and Pg/, 
respectively, i.e., P = Pg and Q — Pg,, we write d n (6, 9') for 
d(Pg, Pg,). If A' is a cr-subfield of A, we define the variational 
distance d(P, Q;A') between P and Q with respect to A 1 by 

d(P,QM')= SU P V|P(A 4 )-Q(A 4 )|, 

where the supermum is over all finite ^'-measurable partitions 
of Z. Given a 5 > and a probability measure P, the van- 
ational ball of radius 5 around P is the set of all probability 
measures Q with d(P, Q) < S. 

Given a source X with process distribution P, let and 
Pi° denote the marginal distributions of P on {Xj}j<o and 
{Xj}j>i, respectively. For each fc > 1, the kth-order absolute 
regularity coefficient (or (3-mixing coefficient) of P is defined 
as [16], [17]: 

Mao = ™p |eE n ^) - p°-M)pr(B 3 )\^ , 

where the supremum is over all finite cr(X° 00 )-measurable 
partitions {A^ and all finite cr(X£°) -measurable partitions 
{Bj}. Observe that 

pp(k) = d (p, p^ x p?mx-oo,x?)) . (i) 

the variational distance between P and the product distribution 
P"^ x Pf° with respect to the a-algebra o^X^, X%°). Since 
X is stationary, we can "split" its process distribution at any 
point I £ Z and define (3p{k) equivalently by 

f3 P (k) = d (P, PL^ x P^aiXi^X^)) . (2) 

Again, if P is subscripted by some 9, P = Pg, then we write 

Pe(k). 

B. Codes 

The class of codes we consider here is the collection of 
all finite-memory variable-rate vector quantizers. Let A" be a 
reproduction alphabet, also assumed to be Polish. We assume 
that X U X is a subset of a Polish metric space y with a 
bounded metric p (-, •): there exists some p max < +oo, such 
that po(y, y') < p max for all y, y' g y. We take p : X x X — > 
[0,Pmax], p{x,x) — po(x,x), as our (single-letter) distortion 
function. A variable-rate vector quantizer with block length 
n and memory length to is a pair C n ' m — (/, (p), where 
/ : X n x A"" 1 -» S is the encoder, >p : S —> X n is the 
decoder, and 5 C {0,1}* is a countable collection of binary 
strings satisfying the prefix condition or, equivalently, the Kraft 
inequality 

5^2- / W<l, 

s£5 



where ^(s) denotes the length of s in bits. The mapping of 
the source X into the reproduction process X is defined by 

That is, the encoding is done in blocks of length n, but the 
encoder is also allowed to observe the m symbols immediately 
preceding each block. The effective memory of C n ' m is defined 
as the set A4 C {1, . . . , to}, such that 

/(a; m ) = Vx m ,x m € X m : x t = 2j,Vi G 

The size |.M| of Al is called the effective memory length of 
C n ' m . We shall often use C" 1 '™ to also denote the composite 
mapping <p o f: X[ l = C n ' m (X?,X°_ m+1 ). When the code 
has zero memory (to = 0), we shall denote it more compactly 
by C n . 

The performance of the code on the source with process 
distribution P is measured by its expected distortion 

Dp(C"' ro )=E P {p„(X 1 ",Xf)}, 

where for x n £ X n and x n £ X n , p n (x n ,x n ) = 
n~ 1 Y^i = iP(xi,Xi) is the per-letter distortion incurred in 
reproducing x n by x n , and by its expected rate 

R P (C n ' m )±E P {e n (f(X?,X°_ m+1 ))}, 

where £ n (s) denotes the length of a binary string s in bits, 
normalized by n. (We follow Neuhoff and Gilbert [18] and 
normalize the distortion and the rate by the length n of the 
reproduction block, not by the combined length n + m of 
the source block plus the memory input.) When working with 
variable-rate quantizers, it is convenient [13], [19] to absorb 
the distortion and the rate into a single performance measure, 
the Lagrangian distortion 

L P (C n ' m , A) = D P (C n ' m ) + XRp(C n ' m ), 

where A > is the Lagrange multiplier which controls the 
distortion-rate trade-off. Geometrically, L P (C n ' m ) is the y- 
intercept of the line with slope —A, passing through the point 
(Rp(C n > m ),D P (C n ' m )) in the rate-distortion plane [20]. If 
P carries a subscript, P = Pg, then we write Dg(-), Rg(-) 
and Lg(-). 

C. Vapnik-Chervonenkis classes 

In this paper, we make heavy use of Vapnik-Chervonenkis 
theory (see Devroye, Gyorfi and Lugosi [21], Vapnik [22], 
Devroye and Lugosi [23] or Vidyasagar [24] for detailed treat- 
ments). This section contains a brief summary of the needed 
concepts and results. Let (Z,A) be a measurable space. For 
any collection C C A of measurable subsets of Z and any n- 
tuple z n £ Z n , define the set C{z n ) C {0, 1}™ consisting of 
all distinct binary strings of the form (1{ Zi£ a} ; • • • 7 l{z„eA})> 
A£C. Then 

S„(C) = max \C(z n )\ 

is called the nth shatter coefficient of C. The Vapnik- 
Chervonenkis dimension (or VC-dimension) of C, denoted by 
V(C), is defined as the largest n for which S„(C) = 2" (if 
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S„(C) = 2" for all n = 1, 2, . . ., then we set V(C) = oo). If 
V(C) < oo, then C is called a Vapnik-Chervonenkis class (or 
VC class). If C is a VC class with V(C) > 2, then it follows 
from the results of Vapnik and Chervonenkis [12] and Sauer 
[25] that S„(C) < n v( - c l 

For a VC class C, the so-called Vapnik-Chervonenkis in- 
equalities (see Lemma 12.11 below) relate its VC dimension 
V(C) to maximal deviations of the probabilities of the events 
in C from their relative frequencies with respect to an i.i.d. 
sample of size n. For any z n £ Z n , let 

i=l 

denote the induced empirical distribution, where 8 Zi is the 
Dirac measure (point mass) concentrated at zi. We then have 
the following: 

Lemma 2.1 (Vapnik-Chervonenkis inequalities): Let P be 
a probability measure on (Z,A), and = (Zi,...,Z n ) 
an n-tuple of independent random variables with Zi ~ P, 
1 < i < n. Let C be a Vapnik-Chervonenkis class with 
V(C) > 2. Then for every S > 0, 



sup |Pz»(A) - P(A)| > 5 } < 8n v(c) e-™ 5 /32 (3) 



and 



E 



{sup|P zr (A)-P(A)| 



< c 



V(C)logn 



(4) 



where c > is a universal constant. The probabilities and 
expectations are with respect to the product measure P n on 
(Z n ,A n ). 

Remark 2.1: A more refined technique involving metric 
entropies and empirical covering numbers, due to Dudley [26], 
can yield a much better 0(l/y/n) bound on the expected max- 
imal deviation between the true and the empirical probabilities. 
This improvement, however, comes at the expense of a much 
larger constant hidden in the O(-) notation. 

Finally, we shall need the following lemma, which is a 
simple corollary of the results of Karpinski and Macintyre 
[27] (see also Section 10.3.5 of Vidyasagar [24]): 

Lemma 2.2: Let C = {A$ : £ £ Mr} be a collection of 
measurable subsets of M d , such that 

A ( = {zeR d : nO, £) > 0}, £ e R k 

where for each z £ M d , H(z, •) is a polynomial of degree s 
in the components of £. Then C is a VC class with V(C) < 
2fclog(4es). 

III. Statement of results 

In this section we state our result concerning universal 
schemes for joint lossy compression and identification of 
stationary sources under certain regularity conditions. We work 
in the usual setting of universal source coding: we are given 
a source X = {Xi}i<zz whose process distribution is known 
to be a member of some parametric class {Pg : 9 £ A}. The 
parameter space A is an open subset of the Euclidean space 
Mr' for some finite k, and we assume that A has nonempty 



interior. We wish to design a sequence of variable-rate vector 
quantizers, such that the decoder can reliably reconstruct the 
original source sequence X and reliably identify the active 
source in an asymptotically optimal manner for all 9 £ A. We 
begin by listing the regularity conditions. 

Condition 1. The sources in {Pg : 9 £ A} are algebraically 
/3-mixing: there exists a constant r > 0, such that 



Pe(k) = 0(k- r ), 



V0 £ A 



where the constant implicit in the O(-) notation may depend 
on 9. 



This condition ensures that certain finite-block functions of 
the source X can be approximated in distribution by i.i.d. 
processes, so that we can invoke the Vapnik-Chervonenkis 
machinery of Section IH-CI This "blocking" technique, which 
we exploit in the proof of our Theorem 13.11 dates back to 
Bernstein [28], and was used by Yu [29] to derive rates 
of convergence in the uniform laws of large numbers for 
stationary mixing processes, and by Meir [30] in the context of 
nonparametric adaptive prediction of stationary time series. As 
an example of when an even stronger decay condition holds, 
let X = {Xj}i S z be a finite-order autoregressive moving- 
average (ARMA) process driven by a zero-mean i.i.d. process 
Y = {Yi}, i.e., there exist poisitive integers p, q and p + q + 1 
real constants ao, a% . . . , a p , bi, . . . ,b q such that 



p 

i=0 



n£Z. 



Mokkadem [31] has shown that, provided the common distri- 
bution of the Yi is absolutely continuous and the roots of the 
polynomial A(z) = 2~2i=o aiZ% ^ e outside the unit circle in 
the complex plane, the /3-mixing coefficients of X decay to 
zero exponentially. 

Condition 2. For each 9 £ A, there exist constants 8g > 
and eg > 0, such that 

d n (0,9>) 



sup ■ 



<ce\\6-0'\ 



for all 9' in the open ball of radius Sg centered at 9, where 
| • || is the Euclidean norm on A. 

This condition guarantees that, for any sequence {<5 n }neN 
of positive reals such that 



(5„ — > 0, \/nS n — > 0, as n 



oo 



and any sequence {9 n }neN m A satisfying \\9 n — 9\\ < S n for 
a given 9 £ A, we have 



d n (9, On) 



0. 



as n 



It is weaker (i.e., more general) than the conditions of Rissanen 
[2], [3] which control the behavior of the relative entropy (in- 
formation divergence) as a function of the source parameters in 
terms of the Fisher information and related quantities. Indeed, 



5 



for each n let 



D n (Pg\\Pg,) = 



1 

-Eg 

n 



1 



In- 



dP e 
dPg, 



X" 



Pe>(x n ) 



dn(x n ) 



be the normalized nth-order relative entropy (information 
divergence) between Pg and Pgi. Suppose that, for each 9, 
D n (Pg\\Pgi) is twice continuously differentiable as a function 
of 9' . Let 9' lie in an open ball of radius 5 around 9. Since 
D(Pg\\Pgi) attains its minimum at 9' = 9, the gradient 
Vg>D n (Pg\\Pgi) evaluated at 9' = 9 is zero, and we can write 
the second-order Taylor expansion of D n about 9 as 

D n {P e \\Pg,) = \{9-9') T J n {9){9-9') + o{\\9-i 



% (5) 



where the Hessian matrix 

d 2 



Jn{9) = 



7 D n (Pg\\Pg,] 



under additional regularity conditions, is equal to the Fisher 
information matrix 

a 2 



In{9) 



1 



n 8 \ 89id9j 



lnp e (X?) 



i,j=l,...,k 



(see Clarke and Barron [32]). Assume now, following Rissanen 
[2], [3], that the sequence of matrix norms {||ln(0)||} is 
bounded (by a constant depending on 9). Then we can write 

D n (Pg\\P e/ ) < l(\\I n (9)\\+o(l))-\\9-9'\\ 2 

9'\\ 2 , 



< 



2 
/ 

Cn 



i.e., the normalized relative entropies D n (Pg\\Pgi) are lo- 
cally quadratic in 9'. Then Pinsker's inequality (see, e.g., 
Lemma 5.2.8 of Gray [15]) implies that y/2D„(Pg\\Pg>) > 
d n (9,9')/y/n, and we recover our Condition 2. Rissanen's 
condition, while stronger than our Condition 2, is easier to 
check, the fact which we exploit in our discussion of examples 
of Section [Vj 

Condition 3. For each n, let A n be the collection of all sets 
of the form 



Ag t e> = {x n eX n :pg(x n )>pg,(x n )}, 



Then we require that, for each n, A n is a VC class, and 
V(A n ) = o(n/logn). 

This condition is satisfied, for example, when \Z(A n ) — 
V < oo independently of n, or when \l{A n ) — logn. The 
use of the class A n dates back to the work of Yatracos 
[33] on minimum-distance density estimation. The ideas of 
Yatracos were further developed by Devroye and Lugosi [34], 
[35], who dubbed A„ the Yatracos class (associated with 
the densities pg). We shall adhere to this terminology. To 
give an intuitive interpretation to A n , let us consider a pair 
9,9' £ A of distinct parameter vectors and note that the set 
{x 11 : pg(x n ) > pg>(x n )} consists of all x n for which the 
simple hypothesis test 



H : X? ~ Pg" versus iJi : A™ - P$ 



is passed by the null hypothesis Hq under the likelihood- 
ratio decision rule. Now, suppose that Z\, . . . , Z m are drawn 
independently from Pg. To each A 6 A n we can associate a 
classifier ka ■ X n — > {0, 1} defined by KA(x n ) = l{ x n eA}- 
Call two sets A, A' £ A n equivalent with respect to the sample 
Z™ = (Zi, . . . , Z m ), and write A A', if their associated 
classifiers yield identical classification patterns: 

(K A (Zx),...,KA(Z m )) = (K A '{Zi),...,K A '{Z m )). 

It is easy to see that is an equivalence relation. From 
the definitions of the shatter coefficients S m (.4„) and the VC 
dimension \J(A n ) (cf. Section lTl-Cl l. we see that the cardinality 
of the quotient set A n j ^z™ is equal to 2 m for all sample 
sizes m < V(^4„), whereas for m > \J(A n ), it is bounded 
from above by m v ( A ™\ which is strictly less than 2 m . Thus, 
the fact that the Yatracos class A n has finite VC dimension 
implies that the problem of estimating the density pg from a 
large i.i.d. sample reduces, in a sense, to a finite number (in 
fact, polynomial in the sample size m, for m > V(A n )) of 
simple hypothesis tests of the type ©. Our Condition 1 will 
then allow us to transfer this intuition to (weakly) dependent 
samples. 

Now that we have listed the regularity conditions that must 
hold for the sources in {Pg : 9 6 A}, we can state our main 
result. 

Theorem 3.1: Let {Pg J e A} be a parametric class of 
sources satisfying Conditions 1-3. Then for every A > 
and every rj > 0, there exists a sequence {C™' m ™}„ e N of 
variable-rate vector quantizers with memory length m n < 
n{n + 7i( 2 +'))/ r + 1) and effective memory length n 2 , such 
that, for all 9 e A, 



ie(C"' m ",A) - inf inf Lg(C n 

m>0C™- m 



O 



lV(An)logn 



(7) 



where the constants implicit in the O(-) notation depend on 
9. Furthermore, for each n, the binary description produced 
by the encoder is such that the decoder can identify the n- 
dimensional marginal of the active source up to a variational 
ball of radius O ( y/\Z(A n ) log n /nj with probability one. 

What (O says is that, for each block length n and each 
9 E A, the code C™' m ™, which is independent of 6, performs 
almost as well as the best finite-memory quantizer with block 
length n that can be designed with full a priori knowledge of 
the rt-dimensional marginal Pg. Thus, as far as compression 
goes, our scheme can compete with all finite-memory variable- 
rate lossy block codes (vector quantizers), with the additional 
bonus of allowing the decoder to identify the active source in 
an asymptotically optimal manner. 

It is not hard to see that the double infimum in (0 is 
achieved already in the zero-memory case, m = 0. Indeed, it 
is immediate that having nonzero memory can only improve 
the Lagrangian performance, i.e., 



(6) 



inf inf Lg(C n 

m>ac n - m 



l ,A)<infL 9 (C",A), 



V0 e A. 
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On the other hand, given any code C n - rn = (/, </?), we can 
construct a zero-memory code Cq = (/o,y?o)> sucn that 
Le(C$,\) < L e (C n ' m ,X) for all 9 G A. To see this, define 
for each x n £ X n the set 

S(x n ) = {s G {0, 1}* : s = f(x n ,z m ) for some z m G A" 71 }, 
and let 

/o(x") = axgmin ^(s)) + M(«)), Vx" G A"™ 

and tpo = <y5. Then, given any (x n ,z m ) G A"™ x A"™, let 
s = /(x n , z m ). We then have 

p n (^<^(z"))+A^(/ (z")) 

= p„(x"^(/ (^)) + A^(/ ( 2 :")) 

<p n {x n ,y{s))+l{s) 

= p n (x n J(x n ,z m ))+£(f(x n ,z m )). 

Taking expectations, we see that L s (Cq,X) < L s (C n ' m ,X) 
for all 9 E A, which proves that 

inf L e (C"\ A) < inf inf L e (C n ' m , A), V6> G A. 

C n m>0 C 11 '™ 

The infimum of Lg(C n , A) over all zero-memory variable-rate 
quantizers C n with block length n is the operational nth-order 
distortion-rate Lagrangian L g l (X) [20]. Because each Pg is 
ergodic, Lg(X) converges to the distortion-rate Lagrangian 

L e (X) = nun (D S (R) + XR), 

where Dg(R) is the Shannon distortion-rate function of Pg 
(see Lemma 2 in the Appendix to Chou, Effros and Gray 
[14]). Thus, our scheme is universal not only in the nth-order 
sense of ||7), but also in the distortion-rate sense, i.e., 

L e (C%' mn ,\) - L e (X) -»• 0, asn^oo 

for every 9 G A. Thus, in the terminology of [14], our scheme 
is weakly minimax universal for {Pg : 9 G A}. 

IV. Proof of Theorem I3.1I 

A. The main idea 

In this section, we describe the main idea behind the proof 
and fix some notation. We have already seen that it suffices to 
construct a universal scheme that can compete with all zero- 
memory variable-rate quantizers. That is, it suffices to show 
that there exists a sequence {C*' mn } of codes, such that 

L 6 (C:^,X) - V e \\) = O (J y{An) n l ° gn ) , V0 G A. 

(8) 

This is what we shall prove. 

We assume throughout that the "true" source is Pg a for some 
9q G A. Our code operates as follows. Suppose that: 

• Both the encoder and the decoder have access to a 
countably infinite "database" c = {6*(i)}i 6 N, where each 
9(i) £ A. Using Elias' universal representation of the 
integers [36], we can associate to each Q(i) a unique 
binary string s(i) with £(s(i)) — logi + (3(loglogi) bits. 



• A sequence {S n } of positive reals is given, such that 

S n — > 0, y/n8 n — > 0, as n — > oo 

(we shall specify the sequence {S n } later in the proof). 

• For each n G N and each 9 G A, there exists a zero- 
memory n-block code C r g L — (fg,<pg) that achieves 
(or comes arbitrarily close to) the nth-order Lagrangian 
optimum for Pg\ Lg(Cg, A) = ££(A). 

Fix the block length n. Because the source is stationary, it 
suffices to describe the mapping of X" into X" . The encoding 
is done as follows: 

1) The encoder estimates Pg g from the m n -block X°_ mn+1 
as PS, where d = d(X° nn+1 ). 

2) The encoder then computes the waiting time 

T n = inf {» > 1 : d n {0{i), 9(X°_ mn+1 )) < V^<5„} , 

with the standard convention that the infimum of the 
empty set is equal to +oo. That is, the encoder looks 
through the database c and finds the first 9(i), such that 
the n-dimensional distribution Pgh\ is in the variational 
ball of radius y/nS n around P~. 

3) If T n < +oo, the encoder sets 9 = 9(i); otherwise, the 
encoder sets 9 = 9d, where 9d E A is some default 
parameter vector, say, 9(1). 

4) The binary description of X" is a concatenation of the 
following three binary strings: (i) a 1-bit flag b to tell 
whether T n is finite (b = 0) or infinite (b = 1); (ii) a 
binary string si which is equal to s(T n ) if T n < +oo 
or to an empty string if T n = +oo; (iii) S2 = fg(X[ l ). 
The string s = bsi is called the first-stage description, 
while s 2 is called the second-stage description. 

The decoder receives bsiS2, determines 9 from s, and produces 
the reproduction AT™ = (pdsz). Note that when 6 = 
(which, as we shall show, will happen eventually almost 
surely), P~ lies in the variational ball of radius ^/n8 n around 
the estimated source P~. If the latter is a good estimate 

of P' e n o , i.e., d n (9o,9) — > as n — > oo a.s., then the 
estimate of the true source computed by the decoder is only 
slightly worse. Furthermore, as we shall show, the almost-sure 
convergence of d n (9 ,9) to zero as n — > oo implies that the 
Lagrangian performance of C~ on Pg is close to the optimum 

L eo (Cl,X) = Ll(X). 

Formally, the code Cl l '" ln is comprised by the following 
maps: 

• the parameter estimator 9 : X m " — > A; 

• the parameter encoder lj : A — > S, where S = 
{0s(i)} i6N U {1}; 

• the parameter decoder ip : S — > A. 

Let / denote the composition go 9 of the parameter estimator 
and the parameter encoder, which we refer to as the first- 
stage encoder, and let 9 denote the composition t/> o / of the 
parameter decoder and the first-stage encoder. The decoder ip 
is the first-stage decoder. The collection {Cg : 9 G A} defines 
the second-stage codes. The encoder /* : X n x X m " n — > SxS 
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and the decoder ip„ : S x S — » A* n of C*'™ 1 " are defined as 

/*(^7>^-m„+l) = /(^"-m„+l)/e(X° m +1 )(^D 

and 

<P*(SS) = ^(5)( S )' se5,s G<S 

respectively. To assess the performance of C™' m ", consider 
the function 



+A 



The expectation Ee {<7pf™, X° mn+1 )} of 5 with respect 
to Pg is precisely the Lagrangian performance of C"' m ", 
at Lagrange multiplier A, on the source Pg . We consider 
separately the contributions of the first-stage and the second- 
stage codes. Define another function h : X n x X mn — > IR+ 
by 



h(X",X_ +1 ) 



-XI 



(A 



e(x°_ r 



so that En 



{h(X?, 



rn n -\-l/ 



■^-mn+ij ^ s ^ e ( rar, dom) La- 
grangian performance of the code C~ a on Pg . Hence, 

so, taking expectations, we get 

+AE eo {^(/(X° roti+1 ))}. (9) 

Our goal is to show that the first term in Eq. (O converges 
to the nth-order optimum Lg (A), and that the second term is 
o(l). 

The proof itself is organized as follows. First we motivate 
the choice of the memory lengths m n in Section IIV-BI Then 
we indicate how to select the database C (Section [IV-Q and 
how to implement the parameter estimator 9 (Section |IV-Dt 
and the parameter encoder/decoder pair (g, ip) (Section [IV-E| >. 
The proof is concluded by estimating the Lagrangian perfor- 
mance of the resulting code (Section IIV-F1 > and the fidelity 
of the source identification at the decoder (Section HV-Gb . 
In the following, (in)equalities involving the relevant random 
variables are assumed to hold for all realizations and not just 
a.s., unless specified otherwise. 



B. The memory length 

Let l n = \n( 2+r, " r ~\, where r is the common decay 
exponent of the /3-mixing coefficients j3g(k) in Condition 1, 
and let m n = n(n + l n ). We divide the m„-block X^_ m +1 
into n blocks Z\, . ■ ■ , Z n of length n interleaved by n blocks 
Yi, . . . , Y n of length l n (see Figure [TJ. The parameter esti- 
mator 9, although defined as acting on the entire X^_ m +1 , 
effectively will make use only of Z" = {Z\, . . . , Z n ), The 
Zj's are each distributed according to Pg g , but they are not 



independent. Thus, the set 

n 

M = \J{{j - l)(n + l n ) + 1 < i < j(n + l n ) - l n } 
i=i 

is the effective memory of C™'" 1 ™, and the effective memory 
length is n 2 . 

Let Q^™) denote the marginal distribution of Z n , and let 
Q( n > denote the product of n copies of Pg o . We now show that 
we can approximate by in variational distance, in- 
creasingly finely with n. Note that both and are de- 
fined on the cr-algebra !F^ n \ generated by all Xj except those 
in Yi,...,Y n , so that d(Q^ n \Q^) = d(Q (n) , Q (n) ; J 7 ^). 
Therefore, using induction and the definition of the /3-mixing 
coefficient (cf. Section Hi- Al l, we have 

d(Q {n \ Q {n) ) <(n- l)Pe {ln) = 0(l/n 1+ "), 

where the last equality follows from Condition 1 and from 
our choice of l n . This in turn implies the following useful fact 
(see also Lemma 4.1 of Yu [29]), which we shall heavily use 
in the proof: for any measurable function a : X n — > [0, M] 
with M < 00, 

E Q( „, {a(Z n )} — Eq ( „) {a(Z n )}\ < M(n - l)0e o (l n ) 

= 0(l/n 1+ "), (10) 

where the constant hidden in the 0( ) notation depends on M 
and on 9q- 



C. Construction of the database 

The database, or the first-stage codebook, c is constructed 
by random selection. Let W be a probability distribution on A 
which is absolutely continuous with respect to the Lebesgue 
measure and has an everywhere positive and continuous den- 
sity w(9). Let C = {0(i)}i S N be a collection of independent 
random vectors taking values in A, each generated according 
to W independently of X. We use W to denote the process 
distribution of C. 

Note that the first-stage codebook is countably infinite, 
which means that, in principle, both the encoder and the 
decoder must have unbounded memory in order to store it. 
This difficulty can be circumvented by using synchronized 
random number generators at the encoder and at the decoder, 
so that the entries of C can be generated as needed. Thus, 
by construction, the encoder will generate T„ samples (where 
T n is the waiting time) and then communicate (a binary 
encoding of) T n to the decoder. Since the decoder's random 
number generator is synchronized with that of the encoder's, 
the decoder will be able to recover the required entry of C. 



D. Parameter estimation 

The parameter estimator 9 : X m ™ — > A is constructed 
as follows. Because the source X is stationary, it suffices 
to describe the action of 9 on X°_ m +1 . In the notation 
of Section IIV-AI let Pz^ be the empirical distribution of 
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Fig. 1. The structure of the code C"' m ". The shaded blocks are those used for estimating the source parameters. 





6 (X_m n+ i) 



z n = (Z U . 
U e (Z n ) 



,Z n ). For every 9 € A, define 
^ sup |J?(A)-iV(A)| 



sup 



p e (x n )dfx(x n ) - P zn (A) 



where A n is the Yattacos class defined by the nth-order 
densities {p$ : 9 E A} (see Section ITTTb - Finally, define 

9{X°_ m+l ) as any 6* € A satisfying 

Ifo.(Z n ) < inf C/ e (Z") + -, 
eeA n 

where the extra 1/n term is there to ensure that at least one 
such 8* exists. This is the so-called minimum-distance (MD) 
density estimator of Devroye and Lugosi [34], [35] (see also 
Devroye and Gyorfi [37]), adapted to the dependent-process 
setting of the present paper. The key property of the MD 
estimator is that 

d n (6(X° mn+1 ),8 ) < 4U 8o (Z n ) + ~ (11) 

(see, e.g., Theorem 5.1 of Devroye and Gyorfi [37]). This 
holds regardless of whether the samples Z\,...,Z n are inde- 
pendent or not. 

E. Encoding and decoding of parameter estimates 

Next we construct the parameter encoder-decoder pair 
(g,ip). Given a 9 £ A, define the waiting time 

T n {9) = inf{z > 1 : d n (6, 0(i)) < ^/ES n }, 

with the standard convention that the infimum of the empty set 
is equal to +oo. That is, given a 9 S A, the parameter encoder 
looks through the codebook C and finds the position of the 
first 0(i) such that the variational distance between the nth- 
order distributions Pg and Pq^ is at most y/n5 n . If no such 
0(i) is found, the encoder sets T n = +oo. We then define the 
maps g and ?/> by 



Os(T n ), 
1, 



if T n < 00 

if T n = oo 



and 



^(o«(»)) = e(»), 1>Q) = 9(i) 

respectively. Thus, S = {Os(i)} U {1}, and the bound 

£(g(6)) < logT„ + 0(loglogT„) (12) 



holds for every 9 € A, regardless of whether T n is finite or 
infinite. 

F. Performance of the code 

Given the random codebook C, the expected Lagrangian 
performance of our code on the source Pg , is 

L 0o (C:' m ",\) =E eo {g (x?,X°_ mn+1) )} 

= Eo {h(X{\X°_ mn+1 )} 

+AE eo {£„(/(X ( ! mn+1 ))}. (13) 

We now upper-bound the two terms in ( fTJl i. We start with the 
second term. 

We need to bound the expectation of the waiting time 
T„ = T n (9(X°_ m Our strategy borrows some elements 

from the paper of Kontoyiannis and Zhang [38]. Consider the 
probability 

q n = W (d n (Q,6(X°_ mn+1 )) < V^S n 

which is a random function of X°_ m +1 . From Condition 2, 
it follows for n sufficiently large that 

q n > w(\\Q-6(X°_ mn+1 )\\ <S n /ci 

where 9 = 9{X_ m , 1 ). Because the density w(0) is ev- 
erywhere positive, the latter probability is strictly positive 
for almost all X°_ m +1 , and so q n > eventually almost 
surely. Thus, the waiting times T n will be finite eventually 
almost surely (with respect to both the source X and the 
first-stage codebook C). Now, if q n > 0, then, conditioned 
on X^_ mn+l — x°_ m +1 , the waiting time T n is a geomettic 
random variable with parameter q n , and it is not hard to show 
(see, e.g., Lemma 3 of Kontoyiannis and Zhang [38]) that for 
any e > 



log[(T„ - l)q n ] > e 



J -m n + l 



< e 



Setting e = log(21ogn), we have, for almost all X™ m +1 , 
that 



P ^log[(T„ - l)q n ] > log(21ogn) X 
<e- 21 °g"<n^ 2 . 
Then, by the Borel-Cantelli lemma, 

log(T„q„) < log log n + 2 



-m„ + l 
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) ) < VnS n , 



eventually almost surely, so that 

Eg a {logT„} < log log n + 2 - Eg {\ogq n } (14) 

for almost every realization of the random codebook C and 
for sufficiently large n. We now obtain an asymptotic lower 
bound on Eg n {\ogq n }. Define the events 

K = {(d n (e(x°_ mn+1 ),9 ) < \fc6 n /2}, 
G n = {d n (e,9 )<V^S n /2}, 
H n = {||0 -0 O || < S n /2cg B }. 
Then by the triangle inequality we have 

F„andG„ => d n (e,9(X- mn+ i> 
and, for n sufficiently large, we can write 

q n > W(G n )P 6o (F n ) 

( = ] W{G n )QW(F n ) 

> W(H n )Q^(F n ), 

where (a) follows from the independence of X and C, (b) 
follows from the fact that the parameter estimator 9(X^_ m ,±) 
depends only on Z n , and (c) follows from Condition 2 and the 
fact that 8 n — > 0. Since the density w is everywhere positive 
and continuous at 9q, w(9) > w(9q)/2 for all 9 £ H n for n 
sufficiently large, so 

W(H n ) = w(9)d9 > ^w(9 )v k (J^j , (15) 

where v k is the volume of the unit sphere in R fe . Next, the 
fact that the minimum-distance estimate 9{X_ m +1 ) depends 
only on Z n implies that the event F n belongs to the a-algebra 
JFO), and from (JTO) we get 



} {n) (F n )>Q^(F n )-0(l/n 1+ ^). 



(16) 



Under Q^ n \ the n-blocks Z\, . . . , Z n are i.i.d. according to 
Pg o , and we can invoke the Vapnik-Chervonenkis machinery 
to lower-bound Q^ n \F n ). In the notation of Sec. |IV-Dl define 
the event 



3 y/nd n 
n ~ 2 



Then /„ implies F n by (fTTl i. and 

Q (n) {FZ) < Q^ n \l c n ) < s^^^e""^ 5 "- 6 /") 2 / 2048 , 

(17) 

where the second bound is by the Vapnik-Chervonenkis in- 
equality © of Lemma 12.11 Combining the bounds (fToT l and 
(fTTT i and using Condition 1, we obtain 



Now, if we choose 

_ ^/2048(V(A 



2048 



0{l/n l+7 <) 
(18) 



1) Inn 



G 



then the right-hand side of (TT~8b can be further lower-bounded 



by 1 — 0(l/n). Combining this with ( fT3T l, taking logarithms, 
and then taking expectations, we obtain 

E 0o {\ogq n } 

> log(l-0(l/n)) + Hog<S„ + 2 c ^ e °) 
= log(l - 0(l/n)) 

+fc log [ V / 2048(V(X) + l)nlnn + 6 
1 



3fc , 

2 n 



c(fe, 



1 

> log(l -0(1/71)) + — log- + c(Mo)> 
2 n 

where c(k, 9q) is a constant that depends only on k and ho- 
using this and ( TT4"1 >. we get that 

E 9o {logT„} < loglogn + 0(logn) 

for W-almost every realization of the random codebook C, 
for n sufficiently large. Together with ( fT2l . this implies that 



E, 



6o {e n (f (x° mn+ i))} 



= o 



logn 



o 



log log n 



- + (1) 
n 



n I \ n 
for W-almost all realizations of the first-stage codebook. 



We now turn to the first term in ( [T3V Recall that, for each 
9 e A, the code C$ is nth-order optimal for Pg. Using this 
fact together with the boundedness of the distortion measure 
p, we can invoke Lemma IA.3I in the Appendix and assume 
without loss of generality that each Cg has a finite codebook 
(of size not exceeding 2™ Pmax / A ), and each codevector can 
be described by a binary string of no more than 2n j o max /A 
bits. Hence, h(X", X°_ mn+1 ) < 3p max . Let P~ and P+ be 
the marginal distributions of Pg on a(X _ oo ) and a(Xf°), 
respectively. Note that h(X n ,X°_ rn +1 ) does not depend on 
X-l +!• This, together with Condition 1 and the choice of l n , 
implies that 

Ee {h(X?,X°_ mn+1 )} 

< e p - x P+ { h (jq* , x°_ mn+1 ) } + p 9o (i n ) 

= E P - xP+ {h (X» X°_ mn+1 )} + 0(l/n 2 +"). 
Furthermore, 

E P - xP+ {h(X?,X°_ mn+1 )} 

h(x n ,z m -)dPe (x n )dP eo (z m ") 



(a) 



(b) 



e So {^(xr,^)}^^) 



Efl„ < La n ( C, 



A 



where (a) follows by Fubini's theorem and the boundedness of 
h, while (b) follows from the definition of h. The Lagrangian 
performance of the code C~, where 9 — 9(X^_ m can be 
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farther bounded as eventually almost surely, so the corresponding expectation is 



0(y/\l{A n ) logn/n) as well. Summing the estimates (fJTJ and 
we obtain 



(a) 

< L 



( " ) £5(A)+3 Praax d„f^- m „ + i)^o N ) ^ " ' 

Finally, putting everything together, we see that, eventually, 



< Ll(\)+4p max d n (6(X° mn+1 ),9 ) ^ ([ 

( d) r \ : N iJH=i;w+oL/ 

< Ll(X) +4 Pmax [d n (0(X o _ mn+1 ),9(Xl mn+i y 



V(A n )logn 



+d n (9(X° mn+1 ),9 



+A 



Q ( lo 1 nx +0 /l glognx + 3 +o(i: 



(23) 



" vvc nave seen uiai uit 

E eo (i 9o (^ A)} < 2^ o (A) E «° {^(^0^(^+0)} 

t n \^-m. n +i) J marginals or the true sourc 



.. + !> 

to show that this convergence also holds eventually with 



, , n e r- t rr-^ri .i » ■• /UN for W-almost every realization of the first-stage codebook C. 

where (a) follows from Lemma IA.3I in the Appendix, (b) ._. J —. & 

follows from the nth-order optimality of for P%, (c) This P roves ®' and hence ®' 

6 

follows, overbounding slightly, from the Lagrangian mismatch 

bound of Lemma IA.2I in the Appendix, and (d) follows from G. Identification of the active source 

the triangle inequality. Taking expectations, we obtain We haye seen ^ the expected variational distance 

between the n-dimensional 
marginals of the true source Pg and the estimated source 
+4/w • E 9o \d n (9(X°_ mn+1 ), 9(X°_ mn+1 )) P hx° mn+1 ) converges to zero as log n/n. We wish 

t0 snow that this 

+d n (0(X° ron+1 ),6Uj. (19) probability one, i.e 

The second d n (-,-) term in ( fT9] l can be interpreted as the ,^ ^j^o )) — O I V{An)^ogn\ 

estimation error due to estimating P e ™ by P~, while the first °' ^ -m n +i)) — \\j n J 

d n (-,-) is the approximation error due to quantization of the 
parameter estimate 9. We examine the estimation error first. p9 " a l most surely. 

Using IE), we can write Given an e > °- we have b ? the trian 8 le ine q ualit y 

d n (9 , 9{X\ ln+1 )) > e implies 

x x -m„+l))^(^"-m„+l)) > e > 

Now, each is distributed according to Pg , and we can _ 

approximate the expectation of U 9o {Z n ) with respect to QW where 0(*- m „+i) is the minimum-distance estimate of P^ 

by the expectation of Ug (Z n ) with respect to the product from x -m n +i (cf. Section ES. Recalling our construction 

measure Q(™)- °f me first-stage encoder, we see that this further implies 

(n) {Ue (Z n )} < EQ (n) {Ug (Z n )} + (n-l)fy(ln) 0(X° m „ +1 )) > e - Vn*„. 

/V(„4„)logn / 1 \ Finally, using the property (fTTT > of the minimum-distance 

C\ hC 



Q 



/ estimator, we obtain that 



O 



V(A0logn\ d n (0 o ,0(*° m „+i)) 



> e 



x ' implies 

where the second estimate follows from the Vapnik- , „, if , — „ 3 

Chervonenkis inequality (0| and from the choice of This, °^ ' 4 I " n 



together with ( |20t . yields Therefore 

As for the first d„(-, •) term in (O, we have, by construction - *2 j ^oC^ ) > J ( 6 - V^n - - 



of the first-stage encoder, that (a) r i / <j 

^(*° ro „ +1 ),»(*°„„ +I >) S SW {"-^> I (-^"-n 

/ /TTTTTi \ +{n-l)0e o (l n ) 

<^5 n =o( f {An l l0gn \ (22) § 8ri v(^) exp r »(e-^n-3/») 



512 y 

+(n-l)/3 9o (Z n ), (25) 
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where (a) follows, as before, from the definition of 
the /3-mixing coefficient and (b) follows by the Vapnik- 
Chervonenkis inequality. Now, if we choose 



512(VU„)lnn + n5) _ r 3 

h y/nd n + - 

n n 

for an arbitrary small 6 > 0, then ( |25l > can be further upper- 
bounded by 8e~ nS + ^2 n n[3g(£ n ), which, owing to Condition 
1 and the choice l„ = \n^ 2+v ^ r ~\, is summable in n. Thus, 

Y,Q {n) {dn(0e,e(X°_ mn+1 )) > e„} < oo, 

n 

and we obtain d24l i by the Borel-Cantelli lemma. 

V. Examples 
A. Stationary memoryless sources 

As a basic check, let us see how Theorem 13.11 applies to 
stationary memoryless (i.i.d.) sources. Let X = R, and let 
{Pg ■ € A} be the collection of all Gaussian i.i.d. processes, 
where 

A = {(m,a) : m G K,0 < a < oo} C R 2 . 

Then the n-dimensional marginal for a given 9 = (m, a) has 
the Gaussian density 



Po 



1 - 

(a;n)= (wj^n e *• 

v ' i— i 



with respect to the Lebesgue measure. This class of sources 
trivially satisfies Condition 1 with r = +oo, and it remains to 
check Conditions 2 and 3. 

To check Condition 2, let us examine the normalized rtth- 
order relative entropy between Pg and Pgi, with 9 = (m,a) 
and 9' = (m',a'). Because the sources are i.i.d., 

D n (Pg\\P e ,) = DiP^Pg 1 ,) 

2 „M2 \ 




(to — m'Y 



- 1 



Applying the inequality In x < x — 1 and some straightforward 
algebra, we get the bound 



D n (P B \\P e >) < 



a + a-'Y (g-g'Y (m-m') 



,M2 



< 1 



o / 2cr' 



2a' 



<J J 2a' 



Now fix a small 8 6 (0, a), and suppose that \\9 — 9'\\ < S. 
Then |er— <t'| < 5, so we can further upper-bound D n (Po\\Pgi) 
by 



Ai(Pfl||iV) < 



2(a-5) 2 " 11 ' 
Thus, for a given 6* S A, we see that 

D n (P e \\P e >) < ^-\\9-9'\\ 2 

for all 9' in the open ball of radius i5 around 9, with eg 



3/ (a — S). Using Pinsker's inequality, we have 

d ^Ml< y/2D n {Pe\\Pe')<ce\\0-&\\ 

for all n. Thus, Condition 2 holds. 

To check Condition 3, note that, for each n, the Yatracos 
class A n consists of all sets of the form 

{ n 
x n e R n : lna 2 - lncr' 2 + — - m) 2 

i=l 

i=l ) 



(26) 



for all m,m' G R;<r, cr' S (0, oo). Let a = lnu 2 and a' 



lncr'"'. Then we can rewrite flzop as 

1 

/ H — - y^(a:i - mY 



i=i 



^^(x 2 -m') 2 >o). 
CT i=i J 



This is the set of all x n € R™ such that 

n(x",a,a',l/cr 2 ,l/cr' 2 ,TO,m') > 0, 

where H(x n , •) is a third-degree polynomial in the six pa- 
rameters (a, a', 1/a 2 , 1/a' 2 , to, m'). It then follows from 
Lemma |2~21 that A n is a VC class with \J(A n ) < 121og(12e). 
Therefore, Condition 3 holds as well. 

B. Autoregressive sources 

Again, let X = R and consider the case when X is a 
Gaussian autoregressive source of order p, i.e., it is the output 
of an autoregressive filter of order p driven by white Gaussian 
noise. Then there exist p real parameters a% , . . . , a p (the filter 
coefficients), such that 



A, 



Vn € N 



where Y = {Yi]i^z is an i.i.d. Gaussian process with zero 
mean and unit variance. Let A C R p be the set of all 
ai,...,a p , such that the roots of the polynomial A(z) = 
Sf=o a i z% ' wnere ao = L lie outside the unit circle in the 
complex plane. This ensures that X is a stationary process. 
We now proceed to check that Conditions 1-3 of Section [Til] 
are satisfied. 

The distribution of each Y{ is absolutely continuous, and we 
can invoke the result of Mokkadem [31] to conclude that, for 
each 9 € A, the process X is geometrically mixing, i.e., for 
every 9 6 A, there exists some 7 = j(9) E (0, 1), such that 
(3 s (k) = 0(j k ). Now, for any fixed r > 0, j k < k~ r for k 
sufficiently large, so Condition 1 holds. 

To check Condition 2, note that, for each 9 S A, the 
Fisher information matrix I n (9) is independent of n (see, 
e.g., Section 6 of Klein and Spreij [39]). Thus, the asymptotic 
Fisher information matrix 1(9) = linin^oo I n (9) exists and is 
nonsingular [39, Theorem 6.1], so, recalling the discussion in 
Section [HI] we conclude that Condition 2 holds also. 
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To verify Condition 3, consider the n-dimensional marginal 
Pg(x n ), which has the Gaussian density 

1 



Pe(x n ) 



-e 2 



1 £c" T fl- 1 (e)x" 



(27rdeti?„(6))) n / 2 

where R n {6) = E e { (X?) T X?) is the nth-order autocorre- 
lation matrix of X. Thus, the Yatracos class A n consists of 
sets of the form 

A e ,e> = | x n E R" : | lndetE" 1 ^) - ^x nT R- 1 (9)x n 



> — In dct iL 



T -d-1, a'\ 



for all 0,0' e A. Now, for every 9 € A, let 9 = 
(0,lndet R~ x (9)). Since lndet R„ (9) is uniquely deter- 
mined by 9, we have Agji — Ag g, for all 9, 9' E A. Using 
this fact, as well as the easily established fact that the entries 
of the inverse covariance matrix R~ 1 {9) are second-degree 
polynomials in the filter coefficients oi, . . . , a p , we see that, 
for each x n , the condition x n E Ag t gi can be expressed as 
U(x n ,9) > 0, where II(a; n ,-) is quadratic in the 2p + 2 
real variables 9i, . . . , 9 p+1 , 9[, . . . , 9' p+1 . Thus, we can apply 
Lemma [2T2] to conclude that V(_4„) < (4p + 4)log(8e). 
Therefore, Condition 3 is satisfied as well. 

C. Hidden Markov processes 

A hidden Markov process (or a hidden Markov model, 
see, e.g., [40]) is a discrete-time bivariate random process 
{(Si, Xi)}, where S — {Si} is a homogeneous Markov chain 
and X = {Xi} is a sequence of random variables which 
are conditionally independent given S, and the conditional 
distribution of X n is time-invariant and depends on S only 
through S n . The Markov chain S, also called the regime, is 
not available for observation. The observable component X is 
the source of interest. In information theory (see, e.g., [41] and 
references therein), a hidden Markov process is a discrete-time 
finite-state homogeneous Markov chain S, observed through 
a discrete-time memory less channel, so that X = {Xi} is the 
observation sequence at the output of the channel. 

Let M denote the number of states of S. We assume 
without loss of generality that the state space S of S is 
the set {1,2,...,M}. Let A = [ay]j j=i,...,Af denote the 
M x M transition matrix of S, where ay = V(St+i = 
j\St — i). If A is ergodic (i.e., irreducible and aperiodic), 
then there exists a unique probability distribution i on 5 
such that 7r = 7r A (the stationary distribution of S), see, 
e.g., Section 8 of Billingsley [42]. Because in this paper we 
deal with two-sided random processes, we assume that S has 
been initialized with its stationary distribution at some time 
sufficiently far away in the past, and can therefore be thought 
of as a two-sided stationary process. Now consider a discrete- 
time memoryless channel with input alphabet S and output 
(observation) alphabet X = K. d for some d < oo. It is specified 
by a set {f>(-|s) : s = 1, 2, . . . , M} of transition densities (with 
respect to [i, the Lebesgue measure on R d ). The channel output 
sequence X is the source of interest. 



Let us take as the parameter space A C R MxM the set of 
all M x M transition matrices [ay], such that all > gq for 
some fixed ao > 0. For each 9 = [ay] € A and each neN, 
the density dPg /dp n is given by 



Pe(x n ) 



E 

s"65™ i=l 



JJa Si _ lSl p(xi|si), 



where a, 



: 7r s for every s E S. We assume that the channel 
transition densities p(-\s), s E S, are fixed a priori, and do not 
include them in the parametric description of the sources. We 
do require, though, that 

Y,P(x\s)>0, \fx E X 



ses 



and 



E e l log^pCX]*) i < oo, VfleA. 
I ses J 

We now proceed to verify that Conditions 1-3 of Section [III] 
are met. 



Letp,y = ¥(St+ n = j\St = i) denote the n-step transition 
probability for states i.jES. The positivity of A implies that 
the Markov chain S is geometrically ergodic, i.e., 



\P 



(n) 
ij 



^■|<C7™, Vi,jeS;VneN 



(27) 



where C > and < 7 < 1, see Theorem 8.9 of Billingsley 
[42]. Note that (f27]l implies that 

d(p in) (-\i),ir) < MCj n . 

This in turn implies that the sequence S = {Si} is exponen- 
tially /3-mixing, see Theorem 3.10 of Vidyasagar [24]. Now, 
one can show (see Section 3.5.3 of Vidyasagar [24]) that there 
exists a measurable mapping F : S x [0, 1] — > X, such that 
Xi — F(Si,Ui), where U = {Ui} is an i.i.d. sequence of 
random variables distributed uniformly on [0, 1], independently 
of S. It is not hard to show that, if S is exponentially (3- 
mixing, then so is the bivariate process {(Si, Ui)}. Finally, 
because Xi is given by a time-invariant deterministic function 
of (Si, Ui), the /3-mixing coefficients of X are bounded by the 
corresponding /3-mixing coefficients of (X, U), and so X is 
exponentially /3-mixing as well. Thus, for each 9 E A, there 
exists a 7 = j(9) E [0,1), such that (3 s (k) = 0(j k ), and 
consequently Condition 1 holds. 



To show that Condition 2 holds, we again examine the 
asymptotic behavior of the Fisher information matrix I n (9) 
as n — > 00. Under our assumptions on the state transi- 
tion matrices in A and on the channel transition densities 
{p(-|s) : s E S}, we can invoke the results of Section 6.2 in 
Douc, Moulines and Ryden [43] to conclude that the asymp- 
totic Fisher information matrix 1(9) = lim ri _, OC) I n (9) exists 
(though it is not necessarily nonsingular). Thus, Condition 2 
is satisfied. 

Finally we check Condition 3. The Yatracos class A n 
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consists of all sets of the form 



An 




i=l 



x JJpC^Isj) > o 

3=1 



for all 9 = [dij], 8' = \a\j\ € A. The condition x n <E Ag^i can 
be written as H(x n ,6, 9') > 0, where for each x n , U{x n \ 9, 9') 
is a polynomial of degree n in the 2M 2 parameters a,ij,a' kl , 
1 _■ hj,k,l < M. Thus, Lemma |2~2l implies that V(^4 n ) < 
4M 2 log(4en), so Condition 3 is satisfied as well. 

VI. Conclusions and future directions 

We have shown that, given a parametric family of stationary 
mixing sources satisfying some regularity conditions, there ex- 
ists a universal scheme for joint lossy compression and source 
identification, with the nth-order Lagrangian redundancy and 
the variational distance between n-dimensional marginals of 
the true and the estimated source both converging to zero as 
yVn log n/n, as the block length n tends to infinity. The 
sequence {V n } quantifies the learnability of the n-dimensional 
marginals. This generalizes our previous results from [9], [10] 
for i.i.d. sources. 

We can outline some directions for future research. 

• Both in our earlier work [9], [10] and in the present paper, 
we assume that the dimension of the parameter space 
is known a priori. It would be of interest to consider 
the case when the parameter space is finite-dimensional, 
but its dimension is not known. Thus, we would have 
a hierarchical model class UfcLii^ : ^ ^ A^}, where, 
for each k, A( fc ) is an open subset of M fc , and we could use 
a complexity regularization technique, such as "structural 
risk minimization" (see, e.g., Lugosi and Zeger [44] or 
Chapter 6 of Vapnik [22]), to adaptively trade off the 
estimation and the approximation errors. 

• The minimum-distance density estimator of Devroye and 
Lugosi [34], [35], which plays the key role in our scheme 
both here and in [9], [10], is not easy to implement in 
practice, especially for multidimensional alphabets. On 
the other hand, there are two-stage universal schemes, 
such as that of Chou, Effros and Gray [14], which do 
not require memory and select the second-stage code 
based on pointwise, rather than average, behavior of the 
source. These schemes, however, are geared toward com- 
pression, and do not emphasize identification. It would be 
worthwhile to devise practically implementable universal 
schemes that strike a reasonable compromise between 
these two objectives. 

• Finally, neither here nor in our earlier work [9], [10] 
have we considered the issues of optimality. It would 
be of interest to obtain lower bounds on the performance 
of any universal scheme for joint lossy compression and 
identification, say, in the spirit of minimax lower bounds 
in statistical learning theory (cf., e.g., Chapter 14 of 
Devroye, Gyorfi and Lugosi [21]). 



Conceptually, our results indicate that links between sta- 
tistical modeling (parameter estimation) and universal source 
coding, exploited in the lossless case by Rissanen [2], [3], are 
present in the domain of lossy coding as well. We should also 
mention that another modeling-based approach to universal 
lossy source coding, due to Kontoyiannis and others (see, 
e.g., Madiman and Kontoyiannis [45] and references therein), 
treats code selection as a statistical estimation problem over 
a class of model distributions in the reproduction space. 
This approach, while closer in spirit to Rissanen's Minimum 
Description Length (MDL) principle [46], does not address 
the problem of joint source coding and identification, but 
it provides a complementary perspective on the connections 
between lossy source coding and statistical modeling. 

Appendix 

In this Appendix, we detail some properties of Lagrange- 
optimal variable-rate vector quantizers. Our exposition is pat- 
terned on the work of Linder [19], with appropriate modifica- 
tions. 

As elsewhere in the paper, let X be the source alphabet 
and X the reproduction alphabet, both assumed to be Polish 
spaces. As before, let the distortion function p be induced by 
a /9 max -bounded metric on a Polish metric space y containing 
X U X. For every n — 1,2,..., define the metric p n on y n 
by 

1 - 

p n {y n ,u n ) = - Y]p{yi,ui). 

i=l 

For any pair pW ; p( 2 ) of probability measures on X n , let 
P n (P (1) ,P (2) ) be the set of all probability measures on 
X n x X n having PW and P^ as marginals, and define the 
Wasserstein metric 

p n (P«,P«)= inf E P {p n (X n ,Y n )} 

Pe-p„(p(i>,p< 2 >) 

inf / Pn (x n ,y n )dP(x n ,y n ) 

P£V n (P (1) ,P (2) ) J 

(See Gray, Neuhoff and Shields [47] for more details and 
applications.) Note that, because p is a bounded metric, 



Pn (x n ,y n )dP(x n ,y n )<p max / l {x ^ yn} dP(x n ,y n ) 



for all P € P„(P (1) , P (2) ). Taking the infimum of both sides 
over all P G P„(P (1) , P (2) ) and observing that 

see, e.g., Section 1.5 of Lindvall [48], we get the useful bound 

-p n {P (1 \P (2) )<\p^d{P^\P^). (A.1) 

Now, for each n, let M. n denote the set of all discrete 
probability distributions on X n with finite entropy. That is, 
Q E M. n if and only if it is concentrated on a finite or a 
countable set {yi}iei Q C X n , and 

H(Q) = - Q(w)i°gQ(ifc)<°°. 
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For every Q S A4 n , consider the set C(Q) of all one-to-one 
maps c : Iq — > {0,1}*, such that, for each c G C(Q), the 
collection {c(i)}i e z Q satisfies the Kraft inequality, and let 



On the other hand, 



mm 

cec(Q) 



E '(<#))Q(w) 

ieiQ 



be the minimum expected code length. Since the entropy 
of Q is finite, there is always a minimizing Cq, and the 
Shannon-Fano bound (see Section 5.4 of Cover and Thomas 
[1]) guarantees that Iq < H(Q) + 1 < oo. 

Now, for any A > 0, any probability distribution P on X n , 
and any Q E M n , define 

L n (P,Q;X) =p n (P,Q) + \n- 1 £ Q . 

To give an intuitive meaning to L n (P, Q; A), let X n and Y be 
jointly distributed random variables with X n ~ P and Y ~ Q, 
such that their joint distribution P e V n (P,Q) achieves 
~p n (P,Q). Then L n (P,Q; A) is the expected Lagrangian per- 
formance, at Lagrange multiplier A, of a stochastic variable- 
rate quantizer which encodes each point x n £ X n as a binary 
codeword with length c*q(i) and decodes it to yi in the support 
of Q with probability P(Y =yi\X n = x n ). 

The following lemma shows that deterministic quantizers 
are as good as random ones: 

Lemma A. 1: Let Lp(C n ,X) be the expected Lagrangian 
performance of an n-block variable rate quantizer operating 
on X n ~ P, and let L p (X) be the expected Lagrangian 
performance, with respect to P, of the best n-block variable- 
rate quantizer. Then 

L P (X) = inf L n {P,Q;X). 
QeM n 

Proof: Consider any quantizer C" = (/, <p) with 
L P (C n ,X) < oo. Let Q C " be the distribution of C n (X n ). 
Clearly, Q c ™ 6 -M„, and 

MC"\A) = ^{ P n(X n ,C n (X n ))} + XE{l n (f(X n ))} 
> p n {P,Q c ™) + Xn-% cn 
= L n (P,Q C n;X). 

Hence, L p (X) > infgeXn L n (P, Q] A). To prove the reverse 
inequality, suppose that X n ~ P and y ~ Q achieve 
~p n (P, Q) for some Q € -A4 n . Let P be their joint distribution. 
Let {yi}iei Q C A"™ be the support of Q, let c*q : Iq — > 
{0, 1}* achieve £q, and let 5 = {cg(«)}igi Q be the associated 
binary code. Define the quantizer C n — (f,(f) by 



and 



f(s)=yi if s = c* Q (i) 



f(x n ) = axgmm(p n (x n , tp(s)) + X£ n (s)) . 
ses 



Then 

L P (C n , A) = E P | mm (p n (X n , <p(s)) + A£„( S ))| . 



L n (P,Q;X) 

= Er{p n (X?,Y) + Xn- 1 £ Q } 

xP(Y = Vi \X n = x n ) 



> 



dP(x n ) min ( Pn (x n , yi ) + X£ n (c Q (i))) 



dP{x n ) min (p n {x n , <p(a)) + X£ n (s)) 
= L P (C n ,X), 

so that miQ(zM n L n (P, Q; A) > L' p (X), and the lemma is 
proved. ■ 

The following lemma gives a useful upper bound on the 
Lagrangian mismatch: 

Lemma A. 2: Let P,P' be probability distributions on X n . 
Then 



Lp(X)-L P ,(X) 



< \p^d{P,P'). 



Proof: Suppose L P (X) > L P ,(X). Let Q' achieve 
mfg e x n L n (P' , Q; A) (or be arbitrarily close). Then 

L P (X)-L P ,(X) 

^ inf L n (P,Q;X)- inf L n (P',Q;X) 
Q£M„ QeM n 

= inf L n (P,Q;X)-L n {P',Q';X) 
QeM n 

<L n (P,Q';X)-L n (P',Q , ;X) 

( " } p n (P, Qf) + Xu-Hq, - p n (P', Qf) - Xn-%, 

= p n (P,Q')-p n (P',Q') 
(c) 

<Pn(P,P) 
(d) 1 

< - Pmax d(P,P>), 

where in (a) we used Lemma lAJl in (b) we used the definition 
of L n (-, Q'\ A), in (c) we used the fact that ~p n is a metric and 
the triangle inequality, and in (d) we used the bound ( IA. lb . ■ 

Finally, the lemma below shows that, for bounded distortion 
functions, Lagrange-optimal quantizers have finite codebooks: 

Lemma A. 3: For positive integers N,L, let Q n (N,L) de- 
note the set of all zero-memory variable-rate quantizers with 
block length n, such that for every C" € Q n (N,L), the 
associated binary code S of C n satisfies |<S| < N and 
£(s) < L for every sg5. Let P be a probability distribution 
on X n . Then 

L P (X)= inf L P (C n ,X), 

C n eQ n (N.L) 

with N < 2 2n P— / A and L < 2np max /X. 

Proof: Let C™ with encoder /* : X n — > 5 and decoder 
: 5 — > Af n achieve the nth-order optimum L p (X) for P. 
Let sq S 5 be the shortest binary string in S, i.e., 



i(so) 



min£(s). 

sG5 



Without loss of generality, we can take /* as the minimum- 
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distortion encoder, i.e., 

/* n ) = arg min (p n (x 11 , (s) ) + \t n (s) ) . 

s£S 

Thus, for any s E S and any x n € f^~ 1 {s), 

p„(x", + A^„(s) < Pn (x n , <p.(s )) + \£ n (so). 

Hence, £(s) < np max /\ + l(s ) for all s e S. Furthermore, 
L P (C2,\) > \E P {t n (f*(X n ))} > Mn(so). 

Now pick an arbitrary reproduction string x$ € X n , 
let £ be the empty binary string (of length zero), and let 
Co be the zero-rate quantizer with the constant encoder 
fo(x n ) = e and the decoder tpo(e) = x„. Then Lp(Cq,X) = 
E P {p n (X n ,x^)} + X£ n (e) < p max . On the other hand, 
L P (C?,X) < Lp(C r \\). Therefore, 

Xtn(so) < L P (C:,\) < L P (C%,\) < /w, 

so that £(so) < np max /\. Hence, 

£(s) < 2np max /X, Vse5, 

Since the strings in S must satisfy Kraft's inequality, we have 

1 > ^2" £(s) > |,S|2- 2 "^"/ A , 

which implies that \S\ < 2 2np ™^/ x . ■ 
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